Skip to content

[SPARK-56917][TESTS][CONNECT] Expand Connect-specific tests for DataFrame column resolution#55947

Closed
zhengruifeng wants to merge 16 commits into
apache:masterfrom
zhengruifeng:SC-229895-connect-col-tests
Closed

[SPARK-56917][TESTS][CONNECT] Expand Connect-specific tests for DataFrame column resolution#55947
zhengruifeng wants to merge 16 commits into
apache:masterfrom
zhengruifeng:SC-229895-connect-col-tests

Conversation

@zhengruifeng
Copy link
Copy Markdown
Contributor

@zhengruifeng zhengruifeng commented May 18, 2026

What changes were proposed in this pull request?

This PR widens test coverage of how a tagged DataFrame column reference (df.col / df["col"], which carries the source DataFrame's plan id) resolves after a range of operators, pinning the behavior across Spark Classic and both modes of spark.sql.analyzer.strictDataFrameColumnResolution on Spark Connect.

The new tests are added to the shared ColumnTestsMixin in python/pyspark/sql/tests/test_column.py, so each one runs in three environments:

  • ColumnTests - Classic (default strictDataFrameColumnResolution=true),
  • ColumnParityTests - Connect strict,
  • ColumnParityTestsWithNonStrictDFColResolution - Connect lenient (extends the strict suite).

The base mixin asserts the behavior shared by all three. The few cases where Connect diverges are overridden in the Connect parity suites (python/pyspark/sql/tests/connect/test_parity_column.py) rather than duplicated, so the shared cases stay mode-agnostic. This follows the base-suite + per-mode-override structure suggested in review.

Per-operator resolution tests (added to ColumnTestsMixin):

  • Pass-through (filter, sort, distinct) - all modes resolve.
  • Attribute-id propagation (groupBy().count(), pivot, intersect, temp-view roundtrip) - all modes resolve; the source attribute id flows through.
  • Removal (withColumnRenamed, drop) - all modes raise (column gone by id and by name).
  • Shadowing (withColumn chain, select + alias, agg + alias) - Classic and Connect strict raise; Connect lenient resolves the shadowed name via name-based fallback (overridden in the lenient suite).
  • Union - Classic resolves the left-side reference (Union keeps the left child's attribute ids); Connect raises CANNOT_RESOLVE_DATAFRAME_COLUMN in both modes, because Union is treated as a leaf during plan-id resolution (overridden in the strict suite, inherited by lenient).
  • Self-join - the alias form is ambiguous and raises in all modes; the documented withColumnRenamed form resolves in all modes (the renamed side is filtered out during disambiguation).
  • Cross-DataFrame illegal reference (df1.select(df2.col)) - raises in all modes; the throw is not gated by the strict/lenient switch.
  • df["*"] star expansion and sort-missing-attribute recovery - resolve in all modes.

Mixed-surface layered programs (3): each chains 4-5 transformations - semi-joins (DataFrame-API EXISTS/IN), window functions, cube aggregation, NTILE, UDFs, struct-field access - and then references the outermost layer's columns in both filter and select. These exercise plan-id propagation across interacting analyzer rules, which single-operator tests miss. All modes resolve.

Why are the changes needed?

#55531 added the strictDataFrameColumnResolution config and a single shadowing test. This PR enumerates the Connect-vs-Classic resolution behavior across shadowing variants, attribute-id-propagating operators, set operations, self-joins, and multi-operator layered pipelines. A future tightening of Connect's column resolution will then surface as a clear test failure rather than a silent regression.

Does this PR introduce any user-facing change?

No. Test-only change.

How was this patch tested?

New tests, run locally across all three suites:

python/run-tests --testnames "pyspark.sql.tests.test_column ColumnTests"
python/run-tests --testnames "pyspark.sql.tests.connect.test_parity_column ColumnParityTests"
python/run-tests --testnames "pyspark.sql.tests.connect.test_parity_column ColumnParityTestsWithNonStrictDFColResolution"

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Anthropic), claude-opus-4-7

@zhengruifeng zhengruifeng changed the title [TESTS][CONNECT] Expand Connect-specific tests for DataFrame column resolution [SPARK-56917][TESTS][CONNECT] Expand Connect-specific tests for DataFrame column resolution May 18, 2026
@zhengruifeng zhengruifeng marked this pull request as ready for review May 18, 2026 05:00
@zhengruifeng zhengruifeng marked this pull request as draft May 18, 2026 05:53
@zhengruifeng zhengruifeng marked this pull request as ready for review May 20, 2026 08:41
@zhengruifeng
Copy link
Copy Markdown
Contributor Author

@vladimirg-db @cloud-fan would you mind taking a look? thanks

@cloud-fan
Copy link
Copy Markdown
Contributor

cloud-fan commented May 22, 2026

Shall we do some investments on the test framework first? It's not the first time we want test parity between classic and connect, for both scala and python APIs, but duplicating the test case is not a good long term solution. How can we define the base test cases in the base suite, and extend the suite in both classic and connect with their dedicated test cases?

@zhengruifeng
Copy link
Copy Markdown
Contributor Author

zhengruifeng commented May 22, 2026

@cloud-fan this PR is not just for parity (thought some tests behave the same on both connect and classic).

E.g.

    def test_resolve_after_select_alias_shadow(self):
        # Same shadowing shape as withColumn but expressed through a select
        # with alias.
        # Classic: fails.
        sdf = self.spark.sql("SELECT 1 AS c")
        with self.assertRaises(AnalysisException):
            sdf.select(sdf["c"].cast("string").alias("c")).select(sdf["c"]).collect()

        # Connect strict: fails.
        with self.connect_conf({"spark.sql.analyzer.strictDataFrameColumnResolution": True}):
            cdf = self.connect.sql("SELECT 1 AS c")
            with self.assertRaisesRegex(AnalysisException, "CANNOT_RESOLVE_DATAFRAME_COLUMN"):
                cdf.select(cdf["c"].cast("string").alias("c")).select(cdf["c"]).collect()

        # Connect lenient: succeeds via name-based fallback.
        with self.connect_conf({"spark.sql.analyzer.strictDataFrameColumnResolution": False}):
            cdf = self.connect.sql("SELECT 1 AS c")
            cdf.select(cdf["c"].cast("string").alias("c")).select(cdf["c"]).collect()

…esolution

Add parity tests in test_connect_column.py and layered programs in
test_parity_dataframe.py to lock in known Connect/Classic behavior
differences in DataFrame column resolution. Tests cover shadowing,
pass-through, aggregation, pivot, set ops, self-join, subquery-as-table
patterns under both strict and non-strict modes of
spark.sql.analyzer.strictDataFrameColumnResolution, plus three
mixed-surface layered programs that combine filters, joins, aggregations,
set ops, window functions, UDFs and temp views.

Generated-by: Claude Code (Anthropic), claude-opus-4-7
…sic baselines

Move the focused parity tests and the layered programs into a dedicated
SparkConnectColumnResolutionTests class in test_connect_column.py. Each
test now starts with a Classic baseline against self.spark to document
how Spark Classic handles the same pattern, followed by Connect strict
and Connect lenient blocks. test_parity_dataframe.py is reverted to its
prior state.

Generated-by: Claude Code (Anthropic), claude-opus-4-7
Replace the three shallow layered programs with Reyden-influenced
fixtures that mirror the patterns referenced in SC-229895
(reyden/query-tests/golden-files/layered-query-tests):

  - 4-level subquery chain with windows, HAVING and correlated EXISTS
  - CTE chain with GROUPING SETS, NTILE, struct field access and
    correlated IN
  - Self-join via SQL with windowed running totals, correlated EXISTS
    and UDF wrapping

Each program builds the deeply layered base via spark.sql(), then
layers DataFrame-API shadowing on top with a tagged df["c"] reference
at the outermost select. Classic and Connect strict raise; Connect
lenient succeeds via name-based fallback.

Generated-by: Claude Code (Anthropic), claude-opus-4-7
Generated-by: Claude Code (Anthropic), claude-opus-4-7
The Connect-specific AnalysisException subclasses the base one, so a
single import covers both. The assertRaisesRegex regex still pins the
Connect-specific CANNOT_RESOLVE_DATAFRAME_COLUMN error class.

Generated-by: Claude Code (Anthropic), claude-opus-4-7
…y assertions

- Rewrite the three mixed-surface layered tests to use only the DataFrame API
  (chained transformations, semi-joins, Window functions, cube, UDFs and
  struct field access). spark.sql() is no longer used in the layered tests.
- After running the suite locally, correct several assertions that assumed
  Classic-vs-Connect divergence where none exists: groupBy, intersect,
  pivot, and temp-view-roundtrip all preserve attribute-id propagation
  through Connect's plan-id resolver, so the tagged reference resolves in
  both strict and lenient modes.
- For union the divergence does hold: Classic succeeds but Connect raises
  CANNOT_RESOLVE_DATAFRAME_COLUMN in both strict and lenient modes (the
  name-based fallback is not triggered for set-op outputs).

Generated-by: Claude Code (Anthropic), claude-opus-4-7
Rename df / df1 / df2 variables that refer to Connect DataFrames in
SparkConnectColumnResolutionTests to cdf / cdf1 / cdf2, matching the
existing sdf convention for Classic. Drop the per-test pyspark.sql.types
imports from the layered tests; the needed types are already imported
at the top of the file.

Generated-by: Claude Code (Anthropic), claude-opus-4-7
Based on the resolution logic in
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ColumnResolutionHelper.scala
add four tests covering behaviors not previously exercised:

- test_resolve_cross_dataframe_illegal_reference: the documented
  df1.select(df2.a) case where df2's plan id is not in df1's plan tree;
  fails with CANNOT_RESOLVE_DATAFRAME_COLUMN before any name-based
  fallback can run.
- test_resolve_df_star: plan-id-tagged star expansion via
  UnresolvedDataFrameStar; succeeds in both Classic and Connect modes.
- test_resolve_self_join_withcolumnrenamed: the documented self-join
  disambiguation example where the right-side candidate is filtered out
  by the rename projection above it.
- test_resolve_sort_missing_attr_recovery: the documented
  df.select(df.v).sort(df.id) case where Sort's reference is recovered
  by resolveExprsAndAddMissingAttrs adding id back to the upstream
  projection, in both strict and lenient modes.

Tighten comments on two existing tests:

- test_resolve_after_union: cite that Union is treated as a leaf node
  when walking the plan tree for plan-id resolution.
- test_resolve_self_join_alias: cite AMBIGUOUS_COLUMN_REFERENCE as the
  specific failure mode.

Generated-by: Claude Code (Anthropic), claude-opus-4-7
Reframe the 3 layered DataFrame tests around ``layered[col]`` usage in
filter and select at the outermost surface (no synthetic shadow on top),
and move them into the shared ``ColumnTestsMixin`` so they run under
Classic (``ColumnTests``), Connect strict (``ColumnParityTests``), and
Connect lenient (``ColumnParityTestsWithNonStrictDFColResolution``) via
the existing parity framework instead of building separate Classic and
Connect arms in one test body.

Generated-by: Claude Code (Anthropic), claude-opus-4-7
Reference DataFrame columns via the ``df.col_name`` getattr form instead
of ``sf.col(...)`` / ``df["col"]`` in the three layered tests, binding the
post-agg intermediates to variables so a DataFrame is in scope for the
HAVING-style filters. Struct subfields keep bracket access because
``detail.name`` would resolve to ``Column.name``.

Generated-by: Claude Code (Anthropic), claude-opus-4-8
Move the tagged DataFrame column-resolution tests out of the Connect-only
SparkConnectColumnResolutionTests and into the shared ColumnTestsMixin, so
each runs under Classic (ColumnTests), Connect strict (ColumnParityTests),
and Connect lenient (ColumnParityTestsWithNonStrictDFColResolution).

The mixin body asserts the behavior shared by Classic and Connect strict.
The diverging cases are overridden in the parity suites:
- union: ColumnParityTests asserts Connect raises CANNOT_RESOLVE_DATAFRAME_COLUMN
  in both modes (Classic resolves via attribute-id propagation in the mixin);
- the shadowing trio: the lenient suite asserts name-based fallback succeeds.

Tagged references use the df.col getattr form; df["*"] keeps bracket access
and untagged transform expressions keep sf.col. Drops the now-unused
SparkConnectColumnResolutionTests and its AnalysisException import.

Generated-by: Claude Code (Anthropic), claude-opus-4-8
@zhengruifeng zhengruifeng force-pushed the SC-229895-connect-col-tests branch from ad6758b to 29f715b Compare May 29, 2026 11:00
zhengruifeng and others added 5 commits May 29, 2026 12:32
Move ``from pyspark.sql.window import Window`` out of the three layered
test methods to the module-level imports. Add a comment on each shadowing
resolution test noting that Connect lenient mode diverges (name-based
fallback succeeds, overridden in the lenient parity suite).

Generated-by: Claude Code (Anthropic), claude-opus-4-8
df.c on a DataFrame with only column `x` raises PySparkAttributeError
before reaching the analyzer. Use df["c"] so the tagged UnresolvedAttribute
is built eagerly and the AnalysisException is thrown at analysis time, which
is what this test pins. Same fix applied in the lenient parity override.

Generated-by: Claude Code (Anthropic), claude-opus-4-8
SELECT 1 AS x gave a df with no column c, so df.c raised
PySparkAttributeError before Spark analysis. Change the source to
SELECT 1 AS c and shadow via agg(sum("c").alias("c")) so df.c is valid
and the tagged reference is aggregated away - same semantics, correct form.

Generated-by: Claude Code (Anthropic), claude-opus-4-8
Both Union and Intersect preserve the left child's exprId
(Union/Intersect.mergeChildOutputs), so neither "emits new attribute ids".
The actual Connect divergence is that Union is treated as a leaf during
plan-id resolution (ColumnResolutionHelper) while Intersect is not.

Co-authored-by: Isaac
…L) group

CUBE produces subtotal groups; ROLLUP is a distinct grouping operator, so
'subtotal' describes the (category, status, NULL) group more precisely.

Co-authored-by: Isaac
@cloud-fan
Copy link
Copy Markdown
Contributor

thanks, merging to master/4.x/4.2 (test only)

@cloud-fan cloud-fan closed this in 37fcee4 May 31, 2026
cloud-fan added a commit that referenced this pull request May 31, 2026
…ame column resolution

### What changes were proposed in this pull request?

This PR widens test coverage of how a tagged DataFrame column reference (`df.col` / `df["col"]`, which carries the source DataFrame's plan id) resolves after a range of operators, pinning the behavior across Spark Classic and both modes of `spark.sql.analyzer.strictDataFrameColumnResolution` on Spark Connect.

The new tests are added to the shared `ColumnTestsMixin` in `python/pyspark/sql/tests/test_column.py`, so each one runs in three environments:

- `ColumnTests` - Classic (default `strictDataFrameColumnResolution=true`),
- `ColumnParityTests` - Connect strict,
- `ColumnParityTestsWithNonStrictDFColResolution` - Connect lenient (extends the strict suite).

The base mixin asserts the behavior shared by all three. The few cases where Connect diverges are overridden in the Connect parity suites (`python/pyspark/sql/tests/connect/test_parity_column.py`) rather than duplicated, so the shared cases stay mode-agnostic. This follows the base-suite + per-mode-override structure suggested in review.

**Per-operator resolution tests** (added to `ColumnTestsMixin`):

- Pass-through (`filter`, `sort`, `distinct`) - all modes resolve.
- Attribute-id propagation (`groupBy().count()`, `pivot`, `intersect`, temp-view roundtrip) - all modes resolve; the source attribute id flows through.
- Removal (`withColumnRenamed`, `drop`) - all modes raise (column gone by id and by name).
- Shadowing (`withColumn` chain, `select` + alias, `agg` + alias) - Classic and Connect strict raise; **Connect lenient** resolves the shadowed name via name-based fallback (overridden in the lenient suite).
- Union - Classic resolves the left-side reference (Union keeps the left child's attribute ids); **Connect** raises `CANNOT_RESOLVE_DATAFRAME_COLUMN` in both modes, because Union is treated as a leaf during plan-id resolution (overridden in the strict suite, inherited by lenient).
- Self-join - the alias form is ambiguous and raises in all modes; the documented `withColumnRenamed` form resolves in all modes (the renamed side is filtered out during disambiguation).
- Cross-DataFrame illegal reference (`df1.select(df2.col)`) - raises in all modes; the throw is not gated by the strict/lenient switch.
- `df["*"]` star expansion and sort-missing-attribute recovery - resolve in all modes.

**Mixed-surface layered programs** (3): each chains 4-5 transformations - semi-joins (DataFrame-API EXISTS/IN), window functions, cube aggregation, NTILE, UDFs, struct-field access - and then references the outermost layer's columns in both `filter` and `select`. These exercise plan-id propagation across interacting analyzer rules, which single-operator tests miss. All modes resolve.

### Why are the changes needed?

#55531 added the `strictDataFrameColumnResolution` config and a single shadowing test. This PR enumerates the Connect-vs-Classic resolution behavior across shadowing variants, attribute-id-propagating operators, set operations, self-joins, and multi-operator layered pipelines. A future tightening of Connect's column resolution will then surface as a clear test failure rather than a silent regression.

### Does this PR introduce _any_ user-facing change?

No. Test-only change.

### How was this patch tested?

New tests, run locally across all three suites:

```
python/run-tests --testnames "pyspark.sql.tests.test_column ColumnTests"
python/run-tests --testnames "pyspark.sql.tests.connect.test_parity_column ColumnParityTests"
python/run-tests --testnames "pyspark.sql.tests.connect.test_parity_column ColumnParityTestsWithNonStrictDFColResolution"
```

### Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Anthropic), claude-opus-4-7

Closes #55947 from zhengruifeng/SC-229895-connect-col-tests.

Lead-authored-by: Ruifeng Zheng <ruifengz@apache.org>
Co-authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 37fcee4)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
cloud-fan added a commit that referenced this pull request May 31, 2026
…ame column resolution

### What changes were proposed in this pull request?

This PR widens test coverage of how a tagged DataFrame column reference (`df.col` / `df["col"]`, which carries the source DataFrame's plan id) resolves after a range of operators, pinning the behavior across Spark Classic and both modes of `spark.sql.analyzer.strictDataFrameColumnResolution` on Spark Connect.

The new tests are added to the shared `ColumnTestsMixin` in `python/pyspark/sql/tests/test_column.py`, so each one runs in three environments:

- `ColumnTests` - Classic (default `strictDataFrameColumnResolution=true`),
- `ColumnParityTests` - Connect strict,
- `ColumnParityTestsWithNonStrictDFColResolution` - Connect lenient (extends the strict suite).

The base mixin asserts the behavior shared by all three. The few cases where Connect diverges are overridden in the Connect parity suites (`python/pyspark/sql/tests/connect/test_parity_column.py`) rather than duplicated, so the shared cases stay mode-agnostic. This follows the base-suite + per-mode-override structure suggested in review.

**Per-operator resolution tests** (added to `ColumnTestsMixin`):

- Pass-through (`filter`, `sort`, `distinct`) - all modes resolve.
- Attribute-id propagation (`groupBy().count()`, `pivot`, `intersect`, temp-view roundtrip) - all modes resolve; the source attribute id flows through.
- Removal (`withColumnRenamed`, `drop`) - all modes raise (column gone by id and by name).
- Shadowing (`withColumn` chain, `select` + alias, `agg` + alias) - Classic and Connect strict raise; **Connect lenient** resolves the shadowed name via name-based fallback (overridden in the lenient suite).
- Union - Classic resolves the left-side reference (Union keeps the left child's attribute ids); **Connect** raises `CANNOT_RESOLVE_DATAFRAME_COLUMN` in both modes, because Union is treated as a leaf during plan-id resolution (overridden in the strict suite, inherited by lenient).
- Self-join - the alias form is ambiguous and raises in all modes; the documented `withColumnRenamed` form resolves in all modes (the renamed side is filtered out during disambiguation).
- Cross-DataFrame illegal reference (`df1.select(df2.col)`) - raises in all modes; the throw is not gated by the strict/lenient switch.
- `df["*"]` star expansion and sort-missing-attribute recovery - resolve in all modes.

**Mixed-surface layered programs** (3): each chains 4-5 transformations - semi-joins (DataFrame-API EXISTS/IN), window functions, cube aggregation, NTILE, UDFs, struct-field access - and then references the outermost layer's columns in both `filter` and `select`. These exercise plan-id propagation across interacting analyzer rules, which single-operator tests miss. All modes resolve.

### Why are the changes needed?

#55531 added the `strictDataFrameColumnResolution` config and a single shadowing test. This PR enumerates the Connect-vs-Classic resolution behavior across shadowing variants, attribute-id-propagating operators, set operations, self-joins, and multi-operator layered pipelines. A future tightening of Connect's column resolution will then surface as a clear test failure rather than a silent regression.

### Does this PR introduce _any_ user-facing change?

No. Test-only change.

### How was this patch tested?

New tests, run locally across all three suites:

```
python/run-tests --testnames "pyspark.sql.tests.test_column ColumnTests"
python/run-tests --testnames "pyspark.sql.tests.connect.test_parity_column ColumnParityTests"
python/run-tests --testnames "pyspark.sql.tests.connect.test_parity_column ColumnParityTestsWithNonStrictDFColResolution"
```

### Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Anthropic), claude-opus-4-7

Closes #55947 from zhengruifeng/SC-229895-connect-col-tests.

Lead-authored-by: Ruifeng Zheng <ruifengz@apache.org>
Co-authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 37fcee4)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@zhengruifeng zhengruifeng deleted the SC-229895-connect-col-tests branch June 2, 2026 12:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants